# ESE 345 Final Project Report

By Kenneth Procacci and Matthew Huber

#### Goals:

The main goal of this project was to implement a 4-stage pipelined multimedia unit of a set of instructions using VHDL. This set of instructions are similar to those in Sony Cell SPU and Intel SSE architectures. The project served as a reminder and an opportunity for us to work with and improve our skills with VHDL. The project was designed in a way for us to work on smaller components pertaining to each of the stages and finally combine the components into the complete pipelined multimedia unit. The components included a multimedia alu, instruction buffers, register file, forwarding unit, and clock-edge sensitive pipeline buffers. When working on the project, we found that testing and simulating each component separately in Aldec Active-HDL before using it in the completed project made the debugging process much easier.

The purpose of using a pipelined design for a processor is to increase performance by allowing multiple computations to perform simultaneously. In our program for example, a single operation (one stage) can be performed in a single clock cycle (say 20ns for this example). Since there are four stages for an instruction to be fully read and executed, it will take 80ns for a single instruction to be executed. If there are multiple instructions and they are executed sequentially, the total execution time required is 80n ns for n instructions. Pipelining allows computations to run in parallel, since a different instruction can be executed in each of the stages simultaneously. As the first instruction is finished in stage 1 and moves on to stage 2, a new instruction can enter stage 1, and so on. As a result, the execution of a single instruction will still be 80ns, but each consecutive instruction will only take 20ns to complete following that first instruction. So the total execution time would be (80 + 20(n-1))ns or (20n + 60)ns.

## **Block Diagram:**



This diagram shows the 4-stage pipeline. Stage 1 is called the Instruction Fetch (IF) stage, and this is where the instruction buffer receives the machine instructions as input from a text file. The program counter is incremented upon every clock cycle, and its value determines which instruction to be output by the instruction buffer. Next, stage 2 is called the Instruction Decode (ID) stage. This is where the process retrieves the 25-bit instruction containing the addresses of the registers to be read and the register to write. It breaks up the instruction into its necessary components used in the ALU. It reads the data at each of the register addresses and outputs those

values to the ALU. The register module also takes in a reg\_write control signal, and the address and data for the destination register currently being written to in stage 4, which will only be used if the reg\_write signal is asserted from the write back stage (Stage 4). Next, stage 3 is called the Execution Stage (EX), which is the actual multimedia ALU. In this stage, the values of the registers being operated on first each pass through a multiplexor, which determines if the register values should be coming from the register file, or from the writeback stage of the pipeline. The control signal for each mux is determined by the forwarding unit, which takes in the reg\_write signal and the 128 bit data value from the write back stage. During this stage, any computations necessary are performed on the register values that were read, and output is a new value to be stored in the destination register rd. The final stage (Stage 4) is the Write Back stage, (WB). This stage takes the 128 bit data to be written to rd and the 5 bit address to the specific register being written to, The WB stage also determines if the reg\_write signal should be asserted or not, which in turn returns to the forwarding unit and the register file, and functions as these components were described earlier.

#### **Assembler:**

The format of the 25-bit machine code instructions is as follows:

Load immediate: 24 23 21 20 5 4 0 load index 16-bit immediate rd

• Includes simals, simals, simsls, simsls, slmals, slmals, slmsls, slmsls

• Includes nop, shrhi, au, cntih, ahs, or, bcw, maxws, minws, mlhu, mlhss, and, invb, rotw, sfwu, sfhs

```
if instruction.startswith("li"):
 instr_list = instruction.rsplit(" ")
 mach_code = "0" + str(bin2(int(instr_list[2]))).zfill(3) + str(bin2(int(instr_list[3]))).zfill(16) + str
elif instruction.startswith("simals"):
 instr list = instruction.rsplit(" ")
 mach_code = "10000" + str(bin2(int(instr_list[4]))).zfill(5) + str(bin2(int(instr_list[3]))).zfill(5) +
elif instruction.startswith("simahs"):
 instr_list = instruction.rsplit(" ")
 mach_code = "10001" + str(bin2(int(instr_list[4]))).zfill(5) + str(bin2(int(instr_list[3]))).zfill(5) +
elif instruction.startswith("simsls"):
 instr_list = instruction.rsplit(" ")
 mach code = "10010" + str(bin2(int(instr list[4]))).zfill(5) + str(bin2(int(instr list[3]))).zfill(5) +
elif instruction.startswith("simshs"):
 instr list = instruction.rsplit(" ")
 mach_code = "10011" + str(bin2(int(instr_list[4]))).zfill(5) + str(bin2(int(instr_list[3]))).zfill(5) +
elif instruction.startswith("slmals"):
 instr_list = instruction.rsplit(" ")
 mach_code = "10100" + str(bin2(int(instr_list[4]))).zfill(5) + str(bin2(int(instr_list[3]))).zfill(5) +
elif instruction.startswith("slmahs"):
 instr_list = instruction.rsplit(" ")
 mach_code = "10101" + str(bin2(int(instr_list[4]))).zfill(5) + str(bin2(int(instr_list[3]))).zfill(5) +
```

The assembler we wrote to get the machine code inputs for the instruction buffer was completed in python. The snippet provided shows the format of the assembler; it is simply a premade if-else list that checks for all possible instructions that could be fed into the pipeline. It converts the instructions from a pseudo MIPS code to machine code and inputs that into the buffer.

The assembly instructions can be entered into the input file on separate lines for the assembler as follows:

Load immediate: li rd index immR4-instruction: instr rd rs1 rs2 rs3



## **Design Procedure and Testbenches:**

Instruction Buffer:

```
library ieee;
use ieee.std_logic_1164.all;
                                                                                                   use ieee.numeric_std.all;
                                                                                                  use work.all:
                                                                                                  entity instruction buffer tb is
                                                                                                   end instruction_buffer_tb;
begin
     process (clock)

file file pointer: text;

variable line_content: std_logic_vector (24 downto 0); 11 -- stimulus signals

variable line_num: line;

variable j: integer:= 0; 13 signal instruction:
                                                                                                  architecture tb_architecture of instruction_buffer_tb is
                                                                                              12 signal clock : std_logic := '0';
                                                                                              13 signal instruction : std_logic_vector (24 downto θ);
           in
file_open (file_pointer, "output.txt", READ_MODE);
while ((not endfile(file_pointer)) and j < 64) loop
readline (file_pointer, line_num);
READ (line_num,line_content);
instructions(j) <= line_content;</pre>
                                                                                              15 constant period : time := 10ns:
                                                                                              17 begin
18
                                                                                              19
20
            j := j + 1;
end loop;
file_close (file_pointer);
                                                                                                             Unit Under Test port map
                                                                                              21
22
                                                                                                         UUT : entity instruction_buffer
port map (
                                                                                                                 clock => clock,
                                                                                              23
      end process;
                                                                                              24
25
                                                                                                                     instruction => instruction
                                                                                              26
27
      process(clock)
                                                                                                               clk<sub>:</sub> process
                                                                                                                                                        -- system clock
       variable PC : integer := 0;
                                                                                              28
                                                                                                                   for i in 0 to 1032 loop
wait for period;
clock <= not clock;
      begin
                                                                                              29
30
      if rising_edge(clock) then
          if (PC < 64) then
   instruction <= instructions(PC);
end if;</pre>
                                                                                              31
                                                                                                                     end loop;
                                                                                              33
34
                                                                                                                     wait;
            PC := PC + 1;
                                                                                                              end process;
      end if;
end process;
                                                                                              35
                                                                                                          end tb architecture;
end behavioral;
```

The instruction buffer is stage 1 of the pipeline, or the instruction fetch stage. It simply reads the text file of the list of 25-bit machine code instructions and store the instructions into an array. The PC counter acts as the index of the array and increments on every clock cycle. The buffer then outputs the instruction at that index. The instruction buffer can hold up to 64 instructions. This is a very simple testbench for the instruction buffer, and the waveform shows a new instruction being passed through on every clock cycle.

#### Waveform:



Register Module:

```
| Second Content of the Content of t
```

The register module is stage 2 of the pipeline, or the instruction decode stage, and its function can be thought of as twofold. First, it receives three inputs from stage 4 of the pipeline (writeback), including a register address, data to be written to that address, and a register write signal that must be asserted for writing to occur. The register module stores an array of 32 128-bit registers that is both read from, and written to when the reg\_write signal is asserted. The data of this array is also written to a registers.txt file that can be viewed during/after simulation. The other function of the register module is reading in the 25-bit instruction and breaking it into components needed for the ALU, such as the addresses of each of the registers and reading their values, the opcode, the 2-bit field to identify the instruction format, and the immediate value and the load index for the load immediate instruction, which are all then output to stage 3.

In this testbench, we instantiate several examples of inputs including some instructions with an asserted reg\_write signal and some without.

#### Our waveform:



The register module also writes the register values into a text file of which a snippet is shown:

#### ALU:

Sample function code snippet: SignedIntegerMultiplySubtractLowWithSaturation (simsls)

The multimedia ALU is stage 3 of the pipeline, or the execution stage, and it essentially performs any computations necessary to output the resulting register value for a given operation. The inputs are the decoded components of the instruction from stage 2, and the input register values are the values read in the register module, unless data forwarding occurs, in which case one of the values (rs1, rs2, rs3) will be replaced by the value in the writeback stage (stage 4). The above code is an example of one of the implementations, the simsls instruction.

#### Testbench:

In our comprehensive testbench, we set an example for every instruction of the code. We just change the instr and opcode value and keep the register values the same for simplicity's sake.

#### Waveform:



This snippet of the waveform shows the first four r4-instructions, and you can see the calculated values for rd for each instruction is correct for the given input register values.

## Forwarding and Multiplexing

```
process(rs1 addr, rs2 addr, rs3 addr, rd addr)
beain
    process (ctrl1, ctrl2, ctrl3, val1, val2, val3, wbval) if (rs1_addr = rd_addr and reg_write = '1') then
                                                                  ctrll <= '1';
    begin
        if ctrl1 = '1' then
                                                                  ctrl1 <= '0';
            val1 out <= wbval;
                                                              end if;
            val1 out <= val1;
                                                              if (rs2 addr = rd addr and reg write = '1') then
        end if;
if ctrl2 = '1' then
                                                                  ctrl2 <= '1';
            val2_out <= wbval;
                                                                  ctrl2 <= '0';
            val2_out <= val2;
        end if;
                                                              if (rs3_addr = rd_addr and reg_write = '1') then
        if ctrl3 = '1' then
             val3_out <= wbval;
                                                                  ctrl3 <= '0';
            val3 out <= val3;
                                                              end if;
        end if;
                                                              end process;
    end process:
```

On the left is a snippet of code from our multiplexer, which checks if the values to be computed require an up-to-date value from a register that is currently being written to. If the forwarding unit on the right determines that an input register address being used in stage 3 is the same as an address currently being written to in stage 4, it will send a respective control signal to the multiplexer, which will then update the value of the necessary register(s) to achieve accurate results in computation during stage 3.

#### **Buffers**

```
A buffer between the instruction decode and execution stages (stages 2 & 3) On each clock cycle, each of the output values from the decoding stage is simply passed through to the ALU for execution.
                                                                                                                                                                                                                                                                                                 library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
                                                                                                                                                                                                                                                                                                                      -- A buffer between the instruction fetch and instruction decode stages (stages 1 & 2)
-- On each clock cycle, the instruction read from the instruction buffer is simply
-- passed through to stage 2 (the register module)
   use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
                                                                                                                                                                                                                                                                                               architecture structural of id_ex is
begin
process (clk, instruction, vall, val2, val3, instr, imm, load_index,
rd_add, rsl_addr, rs2_addr, rs3_addr, opcode)
begin
if rising_edge(clk) then
instruction_out <= instruction;
val1_out <= val1;
val2_out <= val2;
val3_out <= val2;
val3_out <= val2;
instr_out <= instr;
imm_out <= instr;
imm_out <= instr;
imm_out <= instr;
imm_out <= instr_out <= icoad_index;
rsl_add <= rsl_addr;
opcode_out <= opcode;
end if;
end process;
   entity if id is

port {

    clock : in std_logic;

    instruction in : in std_logic_vector(24 downto 0);

    instruction_out : out std_logic_vector(24 downto 0)).
architecture structural of if_id is
begin
process(clock, instruction_in)
begin
if rising_edge(clock) then
instruction_out <= instruction_in;
end if;
end process;
end structural;
     -- A buffer between the execution and write back stages (stages 3 & 4)
-- The resulting output register value from the ALU is passed through
-- on every clock cycle. This buffer outputs that value, along with a
-- reg_write signal and register address to write to.
     library ieee;
use ieee.std_logic_1164.all;
use ieee.numeric_std.all;
     entity ex_wb is port (
                             rd_add : in std_logic_vector (4 downto 0);
instr : in std_logic_vector(24 downto 0);
val : in std_logic_vector (127 downto 0);
clk : in std_logic_vector (127 downto 0);
clx : in std_logic_vector (127 downto 0);
reg_write : out std_logic_vector (4 downto 0)

**Todaddr : out std_logic_vector (4 downto 0)
      end ex_wb;
          architecture structural of ex_wb is
                process (clk, instr, val, rd_add)
begin

if rising_edge(clk) then
--report "rd_addr : " & to_string(rd_add);
--report "val : " & to_string(val);
--report "instr : " & to_string(instr);

val_out <= val;
rd_addr <= rd_add;
-- will only not write new value if the instruction is nop
reg_write <= '0' when instr = "11000000000000000000000" else '1';
end_if:
                   end if;
end process;
```

The above buffers are used to address the timing issues in having multiple stages. These buffers essentially will read in the output from a previous stage, and on each clock cycle (rising edge), they will output those values to the next stage. For example, the instruction that is output from the instruction buffer (stage 1) can only be read into stage 2 on a rising clock edge, since the IF/ID buffer lies between them, and will only send the value once per clock cycle.

## Four-Stage Pipelined Multimedia ALU

```
begin

- input clk signal, output instruction goes to if id

- input clk signal, output instruction from instruction ⇒ instr_buff);

ul : instruction buffer port map (clock ⇒ clk, instruction ⇒ instr_buff, instruction goes to register_module

- input instruction from if id, input data for writing comes from ex_bb output decaded instruction

- input instruction from if id, input data for writing comes from ex_bb output decaded instruction

- input instruction from if id, input data for writing comes from ex_bb output decaded instruction

- input instruction from if id, input data for writing comes from ex_bb output decaded instruction

- input decaded instruction from instruction in instruction to multimedia allow

- input decaded instruction from register module, outputs decaded instruction to multimedia_allow

- reg_addr ⇒ r2_addr, rs2_addr ⇒ r2_addr, rd2_addr ⇒ rd2_addr;

- inputs decaded instruction from register module, outputs decaded instruction to multimedia_allow

- outputs rs1, rs2, rs3 addresses to forwarding unit and values to multiplexer

- input allow from police \cdots \cdots, instruction \sim instruction from register_module, outputs decaded instruction to multimedia_allow

- outputs rs1, rs2, rs3 addresses to forwarding unit and values to multiplexer

- inputs cld add ⇒ rd_add, opcode ⇒ opc, load index ⇒ load in, imm ⇒ imm, instr ⇒ instr_id

- inputs cld add ⇒ rd_addr ⇒ rd_add
```

The above snippet is of our source code for the overall four-staged pipelined multimedia unit. You can see the mappings of all the ports to many different signals to connect each component of this project to one another. At the bottom, you can see the output ports of this unit which are then used to illustrate the waveform shown below.



As you can see from the image above, we have a pipelined multimedia unit with four stages. Each instruction fed into the instruction buffer at the first stage is stepped through the next three stages, one per clock cycle.

```
## Dut it Under Test port map

UT: entity four_stage_pipeline

for tage

clk ≈ clk,

reset ⇒ reset,

instruction ⇒ instruction,

instruction ⇒ controll,

controll ⇒ controll,

controll ⇒ controlls,

rsl_val ⇒ rsl_val,

rsl_val ⇒ rsl_val,

rsl_val ⇒ rsl_val,

rd_val ⇒ rd_val,

vrite_value ⇒ vrite_value,

reg_vrite ⇒ reg_vrite

};

clock : process -- system clock

file file_pointer : text;

variable line_nom : line;

variable line_nom : line;

variable line_nom : line;

segue : integer := 1;

begin le j : integer := 1;

variable line_nom : line;

for i in 0 to 1032 loop

vait for period;

clk c= not clk;

if cline_num, "cycle: " & to_string(j) & ":");

write (line_num, " Stage 1: instr = " & to_string(instruction));

write (line_num, " Stage 2: instr = " & to_string(instruction));

write(line_num, " Stage 2: instr = " & to_string(instruction));

write(line_num, " Stage 2: instr = " & to_string(instruction));

write(line_num, " Stage 2: instr = " & to_string(instruction));

writeline (file_pointer, line_num);

write(line_num, " stage 3: instr = " & to_string(rsl_val));

writeline (file_pointer, line_num);

writeline (file_pointer, line_num)
```

Above is the testbench written to verify the completed four stage pipelined multimedia ALU. It simply generates a clock waveform to be used and writes relevant values to a results.txt file, which can be seen below.

Above is the results file produced by the testbench for the multimedia unit, which shows the status of each individual stage at every clock cycle. The first stage shows the 25-bit input instruction into the IF/ID buffer. The second stage shows the instruction at this point, along with all the input register values. The third stage shows the instruction at this point, along with the resulting register value after performing some ALU operation. The fourth stage shows the write enable signal, the value to be written to the register file, and the address of the register.



Above is the register file produced to show the state of each of the 32 registers at the end of all operations. This file can be compared with an expected results file to verify that everything is working as intended.

#### **Instruction execution:**

Load Immediate: li 0 0 100



As you can see, the first load immediate instruction loads the value 100 (0x64) into halfword at index 0 of register 0. The instruction moves through the 4 stages, computing the correct register value in stage 3 and writing the value to the correct register.

## Signed Integer Multiply-Add Low with Saturation: simals 3 2 1 0



This operation should read the values from registers 0, 1, and 2, multiply the low 16 bits of every word field in registers 0 and 1, add the resulting word to the respective word fields of register 2, and store the result in register 3. For example, using the least significant word, 100\*3+50 = 350, which is 0x15E.

## Signed Integer Multiply-Add High with Saturation: simahs 4 2 1 0



Using the same input registers and still looking at the least significant word, the high 16 bits in registers 0 and 1 are both 0, and the value of the word in register 2 is 50, so 0\*0+50 = 50 (0x32).

## Signed Integer Multiply-Subtract Low with Saturation: simsls 5 4 1 0



Using the input registers rs1=4, rs2=1, rs3=0, data forwarding occurs since register 4 is currently being written to from the previous instruction. While looking at the least significant word, the low 16 bits of rs2 is 3, rs3 is 100, and rs1 is 50. The result is which is 50-100\*3 = -250 (0xFFFFFF06).

Signed Integer Multiply-Subtract High with Saturation: simshs 6 4 1 0

| _                        |                                  | C                                        |                                 |                                          |             |  |  |
|--------------------------|----------------------------------|------------------------------------------|---------------------------------|------------------------------------------|-------------|--|--|
| Signal name              | Value                            | 432 · · 436 · · 440 · · 444 · · · 448    | 452 456 460 464 468             | 472 476 480 484 488                      | 492         |  |  |
| JU clk                   | 1                                |                                          |                                 |                                          | 495 568 ps  |  |  |
| JU reset                 | 0                                |                                          |                                 |                                          |             |  |  |
| <b>I</b> JII instruction | 1600449                          | 1300486                                  | 1400447                         |                                          | 1600449     |  |  |
|                          | 1500448                          | 1200485                                  | 1300486                         | 1400447 X                                | 1500448     |  |  |
| ⊞ ЛГ rs1_val             | 7FFFFFE00010000000000000000032   | 000000000000000000000000000000000000000  | 7FFFFFE000100960000000000000032 | 7FFFFFE000100000000000000000000000000000 |             |  |  |
| ⊞ ЛГ rs2_val             | 00007FFF009600000000000000000000 | 00007FF009600000000000000000000000000000 |                                 |                                          |             |  |  |
| ⊕ лг rs3_val             | 00007FFF000100000000000000000064 | 00007FFF0001009000000000004              |                                 |                                          |             |  |  |
| <b>Ⅲ</b> JU instruction3 | 1400447                          | 1100444                                  | 1299485                         | 1300486 3 X                              | 1400447     |  |  |
| ∄ ЛΓ rd_val              | 7FFFFFFFFFFFFF0000000000015E     | 7FFFFFE00010096000000000000032           | 4000FFFD000100960000000FFFFF06  | 7FFFFFE000100000000000000000000000000000 |             |  |  |
| JU control1              | 0                                |                                          |                                 |                                          |             |  |  |
| JU control2              | 0                                |                                          |                                 |                                          |             |  |  |
| JU control3              | 0                                |                                          |                                 |                                          |             |  |  |
| ☐ J□ write_address       | 06                               | 03                                       | ( 04                            | ( 05 X                                   | 06 <b>U</b> |  |  |
| ∄ ЛГ write_value         | 7FFFFFE00010000000000000000032   | 7FFFFFF000100000000000000015E            | 7FFFFFE00010096000000000000032  | 4000FFFD000100960000000FFFFF06           |             |  |  |
| JU reg_write             | 1                                |                                          |                                 |                                          |             |  |  |

Using the same input registers as the previous instruction, and still looking at the least significant word for each register, the high 16 bits or registers 0 and 1 are 0, so the result is 50-0\*0 = 50 (0x32).

Signed Long Integer Multiply-Add Low with Saturation: slmals 7 2 1 0



Signed Long Integer Multiply-Add High with Saturation: slmahs 8 2 1 0



Same inputs, looking at the low 64-bit fields, since the high 32 bits in registers 0 and 1 are 0, the result is 50+0\*0 = 50 (0x32).

Signed Long Integer Multiply-Subtract Low with Saturation: slmsls 9 2 1 0



Looking at the low 64-bit fields again, the low 32 bits in registers 0 and 1 are 100 and 3, respectively, and the 64-bit field in register 2 is 50. The result is 50-100\*3 = -250 (0xFFFFFFFFFF66).

Signed Long Integer Multiply-Subtract High with Saturation: slmshs 10 2 1 0



Looking at the high 64-bit fields this time, the high 32 bits in both registers 0 and 1 are 32767,

and the 64-bit field in register 2 is 0x7FFFFFE00010000, and the result of (0x7FFFFFE00010000)-32767\*32767 = 0x7FFFFFDC001FFFF.

## Shift Right Halfword Immediate: shrhi 11 1 0



The least significant 4 bits of register 0 are 0b0100 which is 4. Each halfword in register 1 should be shifted to the right by 4 bits. For example, the least significant halfword is 0x0003, so after shifting is 0x0000. Looking at the second most significant halfword, 0x7FFF, is shifted to become 0x07FF.

## Add Word Unsigned: au 12 2 0



This is a standard addition of 32-bit fields. Looking at the least significant word, the sum of 50 (0x32) and 100 (0x64) is 150 (0x96).

## Count 1's in halfword: cntih 13 1 0



Only register 1 is used for this function as rs1, and the number of '1' bits are counted for each halfword, and the count is stored in the corresponding word in register 13. For instance, the least significant halfword is 0x0003 or 0b000000000000011 which contains two '1' bits, so the value 2(0x0002) is stored in the resulting register.

#### Bitwise or: or 14 1 0

| Signal name             | Value                             | 676 680 684 688                          | 692 - 696 - 700 - 704 - 708 -          | 712 716 720 724 728                      | 732 736       |  |  |  |
|-------------------------|-----------------------------------|------------------------------------------|----------------------------------------|------------------------------------------|---------------|--|--|--|
| JUL CIK                 | 1                                 |                                          |                                        |                                          | 736 717 ps    |  |  |  |
| JU reset                | 0                                 |                                          |                                        |                                          |               |  |  |  |
|                         | 1840031                           | 182802E                                  | 183002F                                | 1838030                                  | 1840031       |  |  |  |
|                         | 1838030                           | 181802D                                  | 182802E 2                              | 183002F                                  | ( 1838030     |  |  |  |
| ⊞ JTLF rs1_val          | 00007FFF009600000000000000000000  | 60007FFF00960000000000000000000000000000 |                                        |                                          |               |  |  |  |
| ⊞ ЛΓ rs2_val            | 80007FFF0001000000000000000000064 | 99007FFF001000000000000000000000000      |                                        |                                          |               |  |  |  |
| ⊞ ЛΓ rs3_val            | 00007FFF0001000000000000000000064 | 99007FFF9910000009900000009964           |                                        |                                          |               |  |  |  |
| <b>III</b> Instruction3 | 183002F                           | 181004C X                                | 1818020                                | 182802E                                  | 183002F       |  |  |  |
| III rd_val  III rd_val  | 0000000300000003000000300000003   | 86667FFD06626000000666000000096 X        | 0000000F000400000000000000000000000000 | 80007FFF00970000000000000000000000000000 |               |  |  |  |
| JU control1             | 0                                 |                                          |                                        |                                          |               |  |  |  |
| JU control2             | 0                                 |                                          |                                        |                                          |               |  |  |  |
| JU control3             | 0                                 |                                          |                                        |                                          |               |  |  |  |
|                         | 0E                                | 00 X                                     | 0C                                     | ( OD                                     | ) 0E <b>4</b> |  |  |  |
|                         | 80007FFF0097000000000000000000067 | 80007FFF000100000000000000000064 X       | 00007FFD000200000000000000000096       | 000000F0004000000000000000000000000000   | X             |  |  |  |
| JU reg_write            | 1                                 |                                          |                                        |                                          |               |  |  |  |

This is a standard bitwise or function, where all of the bits in register 1 is ORed with register 0, and you can see the resulting register value is correct.





The least significant word of register 1 (0x00000003) is broadcasted to all 4 words in register 15.

# Max signed word: maxws 16 1 0



For each of the 4 words in registers 1 and 0, the maximum value is stored in the corresponding word in register 16. For example, the least significant word in register 1 holds 3 and register 0 holds 100, so the value stored in register 16 is 100.

## Min signed word: minws 17 1 0



Similar to maxws, it takes the minimum value for each word between the two registers, and in the case of the least significant word again, 3 is the minimum of 3 and 100 so register 16 stores the value 3.

## Multiply low halfword unsigned: mlhu 18 1 0



The low 16 bits of each word in registers 0 and 1 are multiplied and stored in the corresponding word in register 18. For example, the least significant word, registers 0 and 1 store values 100 and 3 respectively, so the product is 300 (0x12C).

## Add halfword saturated: ahs 19 17 16



Since there was a load immediate instruction for register 16 immediately before this, data forwarding occurs for rs2. The halfwords of registers 16 and 17 are summed and stored in register 19. The least significant halfwords store values 100 and 3, so the sum is 103 (0x67). In the case of the second most significant halfwords, the sum of 32767 + 32767 exceeds the maximum value for a 16-bit signed value, so it is saturated to 0x7FFF.

## Multiply by sign saturated: mlhss 21 1 0



The value of each halfword in register 1 is multiplied by the sign of the corresponding halfword in register 0. For the second most significant halfword, 0x7FFF is multiplied by a positive sign, so the result is 0x7FFF. For the third most significant halfword, 150 (0x0096) is multiplied by a negative sign and becomes -150 (0xFF6A).

#### Invert bits: invb 22 1 0



This simply flips all of the bits in register 1 and stores the result in register 22.

#### Rotate bits in word: rotw 23 3 9



This takes the least significant 5 bits of every word in register 9 and rotates the bits in the corresponding word in register 3 by that amount to the right and stores the result in 23. The least significant word in register 3 is 0x0000015E or 0b0000000000000000000000001011110. The least significant 5 bits in the corresponding word in 9 is 00110 or 6. Rotating this by 6 bits, you get 0b0111100000000000000000000000000101 or 0x78000005.

## Subtract from word unsigned: sfwu 24 1 2



This subtracts the contents of each word in register 2 by the contents in the corresponding word in register 1, and stores the result in the corresponding word in register 24. Looking at the least significant word, register 2 holds 50 (0x32) and register 1 holds 3 (0x03), so the result is 47 (0x2F).

#### Subtract from halfword saturated: sfhs 25 1 2



This subtracts the signed halfwords of register 2 by register 1 and stores the result in the corresponding halfword in register 25. You can see saturation occurs in the second most significant halfword, where -2 is subtracted by 32767, which is less than the minimum value of a 16-bit signed integer, so the result is 0x8000. Looking at the least significant halfword, 50 (0x32) subtracted by 3 (0x03) is 47 (0x2F).

#### Conclusion:

We were able to verify that all of our individual components of the four stage pipeline function properly and do so for a wide variety of inputs including edge cases as well. When we put each of these components together for the pipeline, we initially found that some errors started to appear. We recognized that calling nop instructions immediately after some other instruction affected that other instruction from being performed. The example of this that we noticed was when we called nop after a li instruction, that the li instruction never wrote back to the register. The problem seemed to have been a timing issue in the register module, as adding in a couple of other relevant signals to the sensitivity list of the process fixed this issue. It also seemed as though the order of our instructions also comes into play as if the same few li instructions are reordered, the resulting registers in the register file will have different values, which did not make sense because the logic is the same. Another issue we ran into was due to attempting to use output signals as an input after assigning a value to that output in the register module. Signals in VHDL, however, are not necessarily able to be treated like variables in many other programming languages. The final issue that needed to be addressed to see the entire design work as intended was using the std\_match function in the ALU when checking the opcode input. Initially, we were just using the '=' operator to compare and opcode to a vector which included "don't care" values. This did not work, as the '=' operator expects exact input to match, including "don't cares" in the specified locations, whereas, std match will allow any value in the input vector to match with a "don't care." Finally, we were able to see the entire pipelined processor perform operations as intended, with all timing on operations working perfectly, including data forwarding for registers being modified and then read, in back-to-back instructions.